Pretty obvious answer of K=2 there.
For K=2, I plotted the proportion of each sample’s genome that was assigned to eith er K1 or K2:
Given the high amounts of missingess from some of the regions, I wanted to make sure that missingness doesn’t predict a sample’s assignment to one of the two populations. The below plot shows frequency of missing genotypes (y-axis) across samples (x-axis); each sample is colored by the proportion of the genome assigned to K2:
## [1] 0.01368808
## [1] 0.009160282
## [1] 0.007722963
-Regions of interest on: - PC1: Two hits on chrm2 () and one on chrm11 - PC2: Region upstream of dhps on chrm 8 and aat1 on chrm 6 - PC3: dhps, crt
randomForest package to conduct supervised machine learning
##
## Call:
## randomForest(formula = loc2 ~ ., data = train, importance = TRUE, ntree = 10000, mtry = 200, proximity = TRUE, type = classification)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 200
##
## OOB estimate of error rate: 19.89%
## Confusion matrix:
## North South class.error
## North 117 50 0.2994012
## South 20 165 0.1081081
## predictions.test.data
## North South
## North 40 16
## South 4 58
## [[1]]
## [1] 0.9012097
We can slightly improve this by adjusting the rf paramaters, but it’s nothing to write home about
Here are the SNPs associated with the north/south divide, as well as some QC:
##
## Call:
## randomForest(formula = pop ~ ., data = train, importance = TRUE, ntree = 10000, mtry = 200, type = classification)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 200
##
## OOB estimate of error rate: 8.68%
## Confusion matrix:
## 0 1 class.error
## 0 138 34 0.19767442
## 1 14 367 0.03674541
## prediction.test.data
## 0 1
## 0 45 12
## 1 8 119
## [[1]]
## [1] 0.9386656
TO DO: fix seed issue ## Some additional checks
How about treating K as a continious variable (so using regression instead of classification):
##
## Call:
## randomForest(formula = K1 ~ ., data = train, importance = TRUE, ntree = 1000, mtry = 200)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 200
##
## Mean of squared residuals: 0.03148774
## % Var explained: 74.04
## tree variable minimal_depth
## 1 1 chr1_472576 5
## 2 1 chr1_537427 34
## 3 1 chr10_1041222 21
## 4 1 chr10_1065121 6
## 5 1 chr10_1128179 7
## 6 1 chr10_1268080 5
## 7 1 chr10_1391660 15
## 8 1 chr10_1554604 7
## 9 1 chr10_375664 16
## 10 1 chr10_464500 12
##
## Call:
## randomForest(formula = loc2 ~ ., data = train, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 40
##
## OOB estimate of error rate: 14.22%
## Confusion matrix:
## North South class.error
## North 79 24 0.23300971
## South 8 114 0.06557377
## prediction.test.data
## North South
## North 27 7
## South 3 38